Bank Customer Churn Prediction

23000.jpg

Introduction

Customer Churn prediction means knowing which customers are likely to leave or unsubscribe from your service. For many companies, this is an important prediction. This is because acquiring new customers often costs more than retaining existing ones. Once you’ve identified customers at risk of churn, you need to know exactly what marketing efforts you should make with each customer to maximize their likelihood of staying.

Customers have different behaviors and preferences, and reasons for cancelling their subscriptions. Therefore, it is important to actively communicate with each of them to keep them on your customer list. You need to know which marketing activities are most effective for individual customers and when they are most effective.

Impact of customer churn on businesses

A firm with a high churn rate loses a lot of members, which results in lower growth rates and a bigger impact on sales and earnings. Customers may be retained by businesses with low turnover rates.

Why is Analyzing Customer Churn Prediction Important?

Because it costs more to attract new customers than it does to sell to existing ones, customer turnover is crucial. This measure determines whether a firm succeeds or fails. Effective customer retention raises the average lifetime value of the customer, increasing the value of all subsequent sales and boosting unit profits.

Increasing income through recurring subscriptions and dependable repeat business is frequently a better use of a company's resources than spending money on attracting new clients. It's lot simpler to expand and weather financial difficulty if you can keep your existing clients than it is to spend money bringing in new ones to replace the ones who have departed.

174948746-5dc3418a-8296-4cc8-9561-f8f12ca9a0a4.png

Problem Statement :

Customer churn or customer attrition is a tendency of clients or customers to abandon a brand and stop being a paying client of a particular business or organization. The percentage of customers that discontinue using a company’s services or products during a specific period is called a customer churn rate. Several bad experiences (or just one) are enough, and a customer may quit. And if a large chunk of unsatisfied customers churn at a time interval, both material losses and damage to reputation would be enormous.

A reputed bank “ABC BANK” wants to predict the Churn rate. Create a model by using different machine learning approaches that can predict the best result.

Dataset Description :

This is a public dataset, The dataset format is given below.

Inside the dataset, there are 10000 rows and 14 different columns.

The target column here is Exited here.

The details about all the columns are given in the following data dictionary -

Variable Definition
RowNumber Unique Row Number
CustomerId Unique Customer Id
Surname Surname of a customer
CreditScore Credit Score of each Customer
Geography Geographical Location of Customers
City_Category Category of the City (A,B,C)
Gender Sex of Customers
Age Age of Each Customer
Tenure Number of years
Balance Current Balance of Customers
NumOfProducts Number of Products
HasCrCard If a customer has a credit card or not
IsActiveMember If a customer is active or not
EstimatedSalary Estimated Salary of each Customer
Exited Customer left the bank or Not (Target Variable)

Importing Libraries

Importing the Dataset

Summary of Data

Total Unique value

Total Missing values

Deleting Unnecessary Information

The columns RowNumber, CustomerId and Surname are related to personal data of the customers. These columns do not have any quantitative impact on any calculations whatsoever.

Hence, we can avoid these extra columns of information by removing them from the data.

Exploratory Data Analysis

Data Visualization

So about 20% of the customers have churned. So the baseline model could be to predict that 20% of the customers will churn.

Data Preprocessing / Data Preparation

This will allow us to include a negative relation in the modeling.

Scaling

Machine Learning Models

Modeling

1. Logistic Regression classifier

2. Random Forest Classifier

3. Support Vector Machines

3.1 Support Vector Machines with RBF kernel

3.2 Support Vector Machines with Poly kernel

4. Stochastic Gradient Descent (SGD) classifier

4. Extreme Gradient Boost (XGB) classifier

Receiver Operating Characteristic (ROC)

Let us now try to use this model with our test data and see how it works out.

Deep Learning Models

PreProcessing Data

One-hot Encoding

Splitting the dataset

Scaling

1. Neural Network classifier

2. Neural Network classifier with 1 Hidden layer - with Early Stopping

3. Neural Network Architecture with multiple Hidden layers

4. Neural Network Architecture with Early Stopping

Conclusion

Churn prediction is important to predict the customer churn rate so that the management is able to develop customer retention strategies to retain the loyal customers. For the customers churn analytics, it is important to get a low false positive rates. This is because if the false positive rate is high, it means that the system predicts that the customer is churn but indeed the customer is not churn. As a result, Bank may face severe losses due to the wrong prediction given by the system because the promotion will be given to the customers who are actually not churning. Thus, precision is the most important evaluation element as high precision relates to the low false positive rate.

Accuracy

Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. The highest accuracy score in the models is Random Forest Classifier model, the accuracy 89.53% accurate.

Models Training Accuracy Score Testing Accuracy Score
Logistic Regression 88.00 % 81.41 %
Random Forest 94.42 % 89.53 %
Support Vector Machine (SVM) (RBF kernel) 90.00 % 86.00 %
Support Vector Machine (SVM) (Poly kernel) 91.53 % 85.32 %
Stochastic Gradient Descent (SGD) 90.47 % 81.78 %
XGBoost Classifier 96.45 % 88.81 %
Neural Network classifier 89.67 % 84.35 %
Neural Network classifier with Early Stopping 89.67 % 84.35 %
Neural Network Architecture with multiple Hidden layers 86.67 % 86.20 %
Neural Network Architecture with Early Stopping 86.65 % 86.23 %

Precision

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The highest precision score in the models is Random Forest Classifier model, the **precision approximate to the 90.00% which is pretty good. High precision relates to the low false positive rate.

Models Precision for Retained (0) Precision for Exited (1)
Logistic Regression 82.00 % 69.00 %
Random Forest 90.00 % 89.00 %
Support Vector Machine (SVM) (RBF kernel) 86.00 % 84.00 %
Support Vector Machine (SVM) (Poly kernel) 86.00 % 81.00 %
Stochastic Gradient Descent (SGD) 83.00 % 67.00 %
XGBoost Classifier 89.00 % 85.00 %
Neural Network classifier 87.00 % 69.00 %
Neural Network classifier with Early Stopping 87.00 % 69.00 %
Neural Network Architecture with multiple Hidden layers 88.00 % 75.00 %
Neural Network Architecture with Early Stopping 87.00 % 83.00 %

Recall

Recall is the ratio of correctly predicted positive observations to the all observations in actual class. The highest recall score in the model is Random Forest Classifier model, the recall score which is approximate to the 98.00.

Models Recall for Retained (0) Recall for Exited (1)
Logistic Regression 98.00 % 15.00 %
Random Forest 98.00 % 55.00 %
Support Vector Machine (SVM) (RBF kernel) 98.00 % 38.00 %
Support Vector Machine (SVM) (Poly kernel) 98.00 % 36.00 %
Stochastic Gradient Descent (SGD) 98.00 % 19.00 %
XGBoost Classifier 98.00 % 55.00 %
Neural Network classifier 95.00 % 43.00 %
Neural Network classifier with Early Stopping 95.00 % 43.00 %
Neural Network Architecture with multiple Hidden layers 96.00 % 48.00 %
Neural Network Architecture with Early Stopping 98.00 % 43.00 %

F1 Score

F1 Score is the weighted average of Precision and Recall.This score takes both false positives and false negatives into account. The highest F1 score in the models is Random Forest Classifier model, the F1 score approximate to the 94.00.

Models F1 Score for Retained (0) F1 Score for Exited (1)
Logistic Regression 89.00 % 25.00 %
Random Forest 94.00 % 68.00 %
Support Vector Machine (SVM) (RBF kernel) 92.00 % 53.00 %
Support Vector Machine (SVM) (Poly kernel) 91.00 % 50.00 %
Stochastic Gradient Descent (SGD) 90.00 % 30.00 %
XGBoost Classifier 93.00 % 66.00 %
Neural Network classifier 91.00 % 53.00 %
Neural Network classifier with Early Stopping 91.00 % 53.00 %
Neural Network Architecture with multiple Hidden layers 92.00 % 59.00 %
Neural Network Architecture with Early Stopping 92.00 % 57.00 %